arXiv:1501.00676vl [math.OC] 4 Jan 2015 


A VARIATIONAL FORMULA FOR 


RISK-SENSITIVE REWARD 


V. AN ANTHARAM0 and V. S. BORKARl 


ABSTRACT: We derive a variational formula for the optimal growth 
rate of reward in the infinite horizon risk-sensitive control problem for discrete 
time Markov decision processes with compact metric state and action spaces, 
extending a formula of Donsker and Varadhan for the Perron-Frobenius eigen¬ 
value of a positive operator. This leads to a concave maximization formula¬ 
tion of the problem of determining this optimal growth rate. 

Key words: risk-sensitive control; Perron-Frobenius eigenvalue; positive 
operators; variational formula 


1 EECS Department, University of California, Berkeley, CA 94720, USA. Research sup¬ 
ported in part by the ARO MURI grant W911NF- 08-1-0233, Tools for the Analysis 
and Design of Complex Multi-Scale Networks, the NSF grants CNS-0910702 and ECCS- 
1343398, and the NSF Science & Technology Center grant CCF-0939370, Science of Infor¬ 
mation. A part of this work was done while this author was visiting IIT Bombay. 

department of Elec. Engg., IIT Bombay, Powai, Mumbai 400076, India. Work sup¬ 
ported in part by a J. C. Bose Fellowship and grant 11IRCCSG014 from IIT Bombay. 
A part of this work was done while this author was visiting the University of California, 
Berkeley. 


1 



1 Introduction 


Infinite time horizon risk-sensitive control seeks to maximize the asymptotic 
growth rate for mean multiplicative reward in the standard Markov decision 
theory setting. The optimal reward multiplier per step turns out to be the 
Perron-Frobenius eigenvalue of a positive 1-homogeneous nonlinear operator. 
The existence of this Perron-Frobenius eigenvalue and an associated eigen¬ 
function is ensured by the nonlinear Krein-Rutman theorem of m Theorem 
3.1.1 and Proposition 3.1.5] under suitable conditions (see also [36], [33], [32] , 
na, 0). Our aim here is to build on this nonlinear Krein-Rutman theorem 
to provide a variational formula for the optimal growth rate of reward in the 
spirit of the Donsker-Varadhan formula for the Perron-Frobenius eigenvalue 
of a nonnegative matrix [15] section 3.1.2], [TH], [22]. 

Risk-sensitive control has traditionally been studied in the framework of 
cost minimization, see e.g. [16], [26], [27] for recent work on general state 
space models and [20], [21] for its discrete state space precursors. Work on 
risk-sensitive reward maximization has been relatively uncommon, see e.g. 
[28] . Unlike in the case of the classical discounted or ergodic costs, the two 
risk-sensitive control problems are not trivially equivalent by treating cost as 
a negative reward. In fact, risk-sensitive reward maximization is the natu¬ 
ral set-up in portfolio optimization, see e.g. [TO] , Nevertheless, it has been 
commonplace to replace it by risk-sensitive cost minimization so as to ex¬ 
ploit the vastly more abundant available machinery for the latter problem, 
see, e.g. equation (18) of [6]. Interestingly, our approach is tailored for the 
risk-sensitive reward maximization problem. 

The paper is organized as follows. This section presents the basic no¬ 
tation and control-theoretic framework. In section [2] we develop the role of 
the nonlinear Krein-Rutman theorem in giving an expression for the optimal 
reward multiplier per stage. In section [3] this is parlayed into a variational 
expression for the optimal growth rate of reward. Theorem Q] in section [6] is 
the main result of this paper. Alternative variational formulations derived 
from the primary one are discussed in section [4] each of these provides a 
different kind of insight into how to think about the optimal growth rate of 
reward. Some examples are worked out in section [5] to illustrate the nature 
of the results. We close the paper with some concluding remarks in section [6] 
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We turn next to introducing our notation and the control-theoretic frame¬ 
work. For a compact metric space X, M.(X) and V(X) will denote respec¬ 
tively the space of finite (signed) Borel measures on X and the space of prob¬ 
ability measures on X, both with the topology of weak convergence (9j. C(X) 
will denote the Banach space of continuous maps X K y 1Z with the supre- 
mum norm, denoted by || • ||. Thus M.(X) is the dual Banach space of C(X), 
with the weak-* topology [3Sj. Let S be a prescribed compact metric space 
called the state space and U another compact metric space, called the action 
space. We shall consider an 5-valued controlled Markov process (X n ,n > 0) 
controlled by a {/-valued control process (Z n , n > 0) defined as follows. Con¬ 
sider a complete probability space (D,/ 7 , P) where Q := (S x U)°°, and T 
is its product Borel a-field. For c o = [(wo, Wq), (uj\, (u(), (u> 2 , oj' 2 ), ■ ■ ■} 6 hi 
with oji e S and (jj[ e U Vi, define ‘canonical’ random variables X t = 
(Ji,Zi = co[,i > 0. The probability measure P on (i/,/ 7 ) is then the law of 
((X n , Z n ),n > 0) defined as follows. The law of X 0 is prescribed and the law 
of ((X n , Z n ),n > 0) is constructed inductively. For this purpose, define two 
increasing families of sub-cr-fields of T : T~ := a(X m ,m < n\ Z m , rn < n ) 
and T n := cr(X m ,m < n; Z m ,m < n ) for n > 0. First define the conditional 
law of Z 0 given Tq as (f>o(du\X 0 ), where 

(j) 0 (du\x 0 ) : S i y V{U) 

is a prescribed kernel, i.e. (po(du\x) is a probability distribution in V{U) for 
all x and <j)o(A\x) is Borel measurable in x for all Borel subsets A C U. Let P n 
denote the law of ((W 0 , Z 0 ), (Xx, Z\), • • • , (X n , Z n )), defined as a probability 
measure on (O, Xn), starting with n — 0. Define the law of X n+1 given T n 
as p[dy\X n , Z n ) where 


p(dy\x,u) : S x U V(S) 

is a prescribed kernel, i.e. p(dy\x,u) is a probability distribution in V(S) for 
all (x,u) G S x U and p(A\x,u) is Borel measurable in (x,u) for all Borel 
subsets A C S. Define the conditional law of Z n+ \ given T~ +1 as 

0n+1 {du | (Wq, Zq) , • • • , ( X n , Z n ), W n _)_x ) 


where 


4> n+1 (du\(x 0 ,u 0 ) • • • , (x n ,u n ),x n+ 1 ) : (5 x U) n x5g V{U) 


3 


is a prescribed kernel for each n. These together define P n+ By the Ionescu- 
Tulcea theorem (p. 101, [35]), we define a unique P on (fi, P). By construc¬ 
tion, for all Borel A C S, 


P(x n+1 eA\p n ) = P(x n+1 eA\x n ,z n ) 

= p(A\X n , Z n ). (1) 

The ( Z ni n > 0) constructed above will be referred to as admissible controls. 
We shall also consider two special classes of admissible controls: stationary 
Markov controls of the form 


Z n = v(X n ) V n, 

for some measurable v : S i—>■ U, and randomized stationary Markov controls 
satisfying 

P{Z n e A\P n ) = P{Z n e A\x n ) = <p{A\X n ) V n, V Borel A C U, 

for some kernel (p(du\x) : S i —y V{U). By a standard abuse of terminology, 
we identify these with the maps u(-),<^(-|-) resp. The sets thereof will be 
denoted by SM and RM respectively. We view SM as a subset of RM by 
identifying v(-) with 6 V (.), the Dirac measure at v(-). 


The infinite horizon risk-sensitive reward we seek to characterize is 


A := sup sup lim inf — log E 

xeS ^too N 


fX2m=0 r (X m ,Zm,X rrl+ l) I ^ 


( 2 ) 


where the second supremum is over all admissible controls. Here r(x,u,y) is 
an extended-real-valued function on S x U x S, called the ‘per stage reward’ 
on transitioning from x to y under action u. It should be noted that we 
will allow e r ( x ’ u,y ' > = 0 for some ( x,u,y ), so r(x,u,y) should be thought of as 
being allowed to take the extended real value —oo. 

Throughout the paper, we make the following assumptions about r(x, u, y) 
and p(dy\x, u). We will occasionally explicitly recall these assumptions to re¬ 
mind the reader of this. 


(AO): e r< ' x ’ u,y ' > e C{S xU x S). 
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(Al): The maps (x,u) i-G- f f(y)p(dy\x, u), f G C(S) with ||/|| < 1, are 
equicontinuous. This is true, e.g., if S is a compact metric space, U is a com¬ 
pact metric space, and p(dy\x,u) = if;(y\x,u)(p(dy) with <p G V(S) having 
full support and ip(y\-, -),y G S, equicontinuous. 


We shall denote by e rM the least upper bound for which is finite 


by virtue of assumption (AO). 

Towards the end of the next section we will build up to the main varia¬ 
tional formula by first considering the case where we have additional restric¬ 
tions captured by the following assumptions. 

(A0+): Condition (AO) holds and we have e r ^ x,u,y ^ > 0 for all (x,u,y). 

(A1+): Condition (Al) holds and p(dy\x,u) has full support for all x,u. 
For instance, if S is a compact metric space, U is a compact metric space, 
and p(dy\x,u) = il>(y\x,u)tp(dy) as above with -),y G S, equicontinu¬ 

ous, then tj}(-\x,u) > 0 on S will ensure that this assumption holds. 

We shall denote by e rm > 0 the greatest lower bound for when 

(A0+) holds. 

If p(dx) and q(dx) are finite nonnegative Borel measures on a compact 
metric space A, we write D(p(dx)\\q(dx)) for the relative entropy of p(dx) 
with respect to q(dx), defined by 



f p(dx) log l(x) if we can write p(dx) = l(x)q(dx) 
oo otherwise. 


See e.g. [4T] for some of the basic properties of relative entropy. 


2 The Perron-Frobenius eigenvalue 


Let assumptions (AO) and (Al) be in force. Define the operator T : C(S) i —> 
C(S) by 
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For fixed x G S on the left hand side of (j3J) the supremum on the right hand 
side is the expectation of a continuous affine function on a compact set of 
probability measures, ffence, it is a maximum attained at a Dirac measure. 
For each fixed / e C(S), a standard measurable selection theorem [5, Lemma 
1, p. 182] allows us to choose the family of maximizers, parametrized by 
x G S, as a measurable function v : S i— >• U . To see that T is a map 
C{S) H y C(S), note that for / G C(S) with ||/|| < R, 


\T f(x) — T f(x')\ 

= I sup [ I p(dy\x : u)cl)(du)e r{x ’ u ' y) f(y) 

~ sup [ [p(dy\x',u)(j)(du)e r(x '’ u ' y) f(y)\ 
= | sup f p(dy\x,u)e r(x ’ u ’ y) f(y) 

U J 

- sup f p(dy\x',u)e r{x '’ u ' y) f(y)\ 

U J 


< e rM sup sup | / p(dy\x,u)f{y) 
“ f-\\f\\<R J 


p(dy\x',u)f(y) \ + R max 

u,y 


r(x,u,y) _ r(x',u,y) 


Ass-> x 1 , the first term on the right tends to zero by (Al) and the second 
term on the right tends to zero by uniform continuity of e r , being a contin¬ 
uous function defined on a compact set, by (AO). In fact, this shows that 
Tf,\\f\\ < R, are equicontinuous and bounded. Also, from the definition of 
T, it is straightforward to check that 


\\Tf — Tg\\ < e rM \\f — g\\. 


which establishes T as a continuous (in fact, Lipschitz) map C(S) C(S). 


Likewise, define, for / e C(S), 


T {n) f(x) 


sup E 


e EZ.Jo r (Xm,Zm,,X m+1 ) ^ Xn )\X 0 = X 


? 


where the supremum is over all admissible control processes. Then T^ l > = T, 
by virtue of the measurable selection theorem alluded to after (|3]) . We use 
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the convention T^ ()) := the identity map. 


Lemma 1. ( T^ n \n > 0 ) is a semigroup of operators on C(S). □ 

Proof of Lemma\l\ Note that we need to verify that for n > 2 maps C(S) 
to C(S) as part of the stated claim. This follows as a corollary of the proof, 
which establishes that T( n > is the n-fold concatenation of T with itself. The 
proof follows by a standard dynamic programming argument. Specifically, 
we first have 


T (n) /( 

= sup E 
< sup E 


X 


e E,l=to r ( X rn,Zm,X m +l)f(X n )\X 0 = X 


= r(X 0 ,Z 0 Al) 


sup E 


e E”-= 1 id^,^,^ +1 ) / ( Xn )| Xo)Zo;Xi l | Xq = x 


= sup E [e r ^ Xo ’ Zo ’ Xl) T^ fiX^Xo = x] , 


(4) 


where the inner supremum in the second line is over the control sequence from 
time 1 onwards, conditioned on X 0 = x 0 ,Z 0 = z 0 ,X i = X\ (say). Secondly, 
let e > 0. By [TUI Lemma 1, p. 55], conditioned on (X 0 , Z 0 ,Ab), there exists 
an admissible state-control sequence (X' m , Z' m ),m > 1, with X[ = X± such 
that 


e E™ = \ r{X' m ,z' m ,x ’ m+1 ) f ( X ’ n ) | x[ 


> sup E 


e E “=' '■(x m ,z rn ,x rn+ i)/(x n ) | Ad 


Let A' = X 0 = x, Z’ 0 := argma x(fp(dy\x, ■)e r( ' x, '’ y ' > T ( ' n 1) /). Then 


is an admissible state-control sequence and 


T {n) f(x) > E [e E -=o r(x -’ z -’ x -+i ) /(X)|A' = x 


> E 


jr{X 0 ,Z 0 ,Xi) 


sup E 


o52m=l r (Xm,Zm,Xm+ 1 ) 


HxJlXi 


e™e 


= E [ e K*o,2oAi) T (™-i) f(X 1 )\X 0 = x]-e 
= (T^- 1 )/) (x) - e 


I'M. 


cXM . 


(5) 
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Combining 03]), ([5]) and using the fact that e > 0 was arbitrary, it follows 


that = T^oT^ A A similar argument shows that T (r d/ = T^ n do Tf. 

□ 


The semigroup (T^ n \n > 0), is precisely the discrete time counterpart of 
the Nisio semigroup [351 |. 

Let C + (<S) := {/ G C(S) : f(x ) > 0} denote the set of nonnegative 
functions in C(iS). Then C + (<S) is a cone , i.e. it is closed under addition 
and scalar multiplication by nonnegative real numbers, and we have C + (5) D 
(—C + (iS)) = {0} where 6 denotes the constant function that is identically 
zero. Thus C + (d>) defines a partial order on C(S), denoted >, given by f > g 
if f - g E C + {S). We write f > g (equivalently, g < f) if / > g, f ± g, and 
we write / >> g if f — g is a strictly positive function in C(S) or equivalently 
if / — g G int(C' + (d>)), where int(C' + (d>)) denotes the interior of C* + (d>). The 
dual cone of C + (5) is the cone in the dual Banach space A4(S) given by 
{/i G M(S) : f fdn > 0 V / G C + (5)}. This is the set of finite nonnegative 
measures on S, which we denote by Ad + (iS). For more on cones in Banach 
spaces, see |2j. 

Let us now make the additional assumption (AO+) and (A1+). One 
can then verify the following additional properties of for each n > 1. 

1. Tb) is strictly increasing, i.e., f < g implies / < T^g. In view of 
the fact established above that ( T^ n \n > 0) is a semigroup, it suffices 
to prove this claim for n — 1 . We know that there is a measurable 
function v : S (->■ U such that 



Then 


Tg(x) - Tf(x ) 

p(dy\x,v(x))e r ^^f(y) 



> 


0, 


because / < g, f ^ g and support (p(dy\x, u)) = S V x,u. 

2. T is strongly positive, i.e., / G C + (S ), f ^ 9 ==>• T^f G int(C' + (<S)). 
This follows from the fact that for any n 0 £ U, 

T^f(x) > e nrm Jp(dy\x,u 0 )f(y) > 0, 

where we use the fact that support (p(dy\x, uq)) = S. 

3. T99 is positively one-homogeneous, i.e., for c > 0, T^ n \cf) = cT^f. 
(This holds under the weaker assumptions (AO) and (Al).) 

4. For M > e~ nrrn and / G C(S) defined by /(•) = 1, MT^f > f. 

5. is compact. (This holds under the weaker assumptions (AO) and 
(Al).) It suffices to verify this for n — 1, the general case being then 
a consequence of the semigroup property. By (Al), the family x h -> 
Ff(x,u) := f f (y)e r< ' x ’ u ’ y ' > p(dy\x, u), u G U , ||/|| < R, is equicontinuous 
and bounded in C'(<S)-norm by e rM R. Hence it is relatively compact in 
C(S) by the Arzela-Ascoli theorem. Let 5 G [0,1] i —> wg(-) denote its 
common modulus of continuity relative to a compatible metric Kon5. 
Then T : C(S) h -> C(S) satisfies ||T/|| < e rM R for ||/|| < R, f G C(S), 
and, 

sup \\Tf(x)-Tf(y)\\ 

x,yES,K,(x,y)<5 

< sup || supFf(x, u) — swpFf(y, u)|| 

x,y£S,K,(x,y)<5 u u 

< sup sup \\Ff(x,u) - Ff(y,u)\\ 

x,yES,K,(x,y)<5 u 

< ws(Ff) 0 

uniformly in / : ||/|| < R. Thus Tf, ||/|| < R , ie equicontinuous. 
By Arzela-Ascoli theorem, it is relatively compact, implying that T : 
C(S) i —y C(S) is a compact operator. 

The preceding considerations allow us to state the following theorem. 
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Theorem 1. Under the assumptions (A0+) and (A1+), there exists a 
unique p > 0 (the Perron- Frobenius eigenvalue) and int(C + (S )) such 

that Tifj = p'lp, i.e., 

pi>(x) = sup f f p(dy\x, u)4>(du)e r( ' x,u ’ y ' , 'ip(y), (6) 

4>&v{u)J J 

with p given by 


P 


inf sup 

/eint(C+(5)) fi£M + {S) 


}Tfdn 

5 fit* 


sup inf 

f&nt(C+(S))V&M+(S) 


jTfdp 
J f dp 


(7) 

□ 


Equation (J7J) is an abstract version of the celebrated Collatz-Wielandt 
formula for the Perron-Frobenius eigenvalue of irreducible nonnegative ma¬ 
trices, see e.g. [34 j. 


Before proceeding to the proof of Theorem [TJ it is appropriate to make a 
few remarks. A great deal is known about analogs of the Perron-Frobenius 
theorem for increasing positively one-homogeneous maps on finite dimen¬ 
sional vector spaces, see the recent book (30]. When the map is on an ordered 
Banach space (and one talks about a Krein-Rutman theorem rather than a 
Perron-Frobenius theorem, in view of the seminal work in [29]), we rely on 
Theorem 3.1.1, Proposition 3.1.5, and Lemma 3.1.7 of [37], as seen in the 
proof below (see also [36], [33] ). These results in [37] are themselves stated 
in a much broader context than the special case of the Banach space C(S) 
and the order structure defined by the cone C' + (<S>), with S a compact metric 
space, which suffices for our purposes. The recent papers [32] and [T2] claim 
even stronger nonlinear Krein-Rutman theorems. However, it has been rec¬ 
ognized in |3] that some of the claims in these papers are wrong. The proof 
of the Theorem [D given below does not rely in any way on [32], [L2], or [3]. 


Proof of Theorem [D We define 


l|r (n) ||+ := sup{||r<">/|| : / eC+(5), ll/ll <1}, n>0. 
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Since (T^ , n > 0) is a positive semigroup, it is straightforward to check that 
||T( fc +0|| + < ||T«|| + ||T«|| + for all k,l> 0, and so 

r(T) : = lim ||T (n) ||f 

n—>oo 


exists. By the fourth of the properties of the semigroup {T^ n \n > 0) shown 
above, we have r(T) > 0. It will turn out that the p promised in the statement 
of Theorem Q] is just r(T). 

Strong positivity of T, which was shown above, verifies assumption A4 
in m pg. 47], and the facts that T is compact (as established above), one- 
homogeneous, and order preserving are respectively the conditions Al, A2, 
and A3 in [37] pg.47]. Thus [37] Proposition 3.1.5.] provides the additional 
requirement in the statement of [37], Theorem 3.1.1] that T have an eigen¬ 
value, and m Theorem 3.1.1] states that with p taken to be r(T ) there 
exists a 0 G int(C + (iS)) such that (JHD holds. 

It remains to establish (I7|) , where we now know that p = r(T). We have 


p > inf sup 

/eint(C+(S)) Me x+( 5 ) 

which comes from substituting -0 as a choice for / on the right hand side. 
Similarly, we have 



P < 


sup inf 

/eint(C+(S)) V&M+{S) 


fTfd/i 
I f'Jlt 


Thus it suffices to establish 


inf sup 

/Gint(C+(S)) ^M+iS) 


fTfdy 
I J'lu 


> p > sup inf 

/eint(C+(S)) AiGA4+(5) 


fTfdp, 
J f dp 


( 8 ) 


Given / e int(C' + (5)), we have 


Tf< 


( sup 


fTfdp) 

ff d M J 


f ■ 


From m Lemma 3.1.7 (ii)], we have r(T) < sup /i6 ^ /1 +( 5 ) • Since this 

holds for all / G int(C + (<S)), this establishes the first inequality in OH])- The 
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proof of the second inequality in (jSJ) is similar, based on m Lemma 3.1.7 
(iii)]. This concludes the proof of Theorem [0 □ 


Next we show that logp is in fact the optimal growth rate of the risk- 
sensitive reward. For a development of the analogous result in the case of 
controlled diffusion processes, see [T]. As argued earlier, in connection with 
the right hand side of (13|) . for each x G S, the supremum on the right hand 
side of © is the expectation of a continuous affine function on a compact 
set of probability measures, and is therefore a maximum attained at a Dirac 
measure. A standard measurable selection theorem [5] Lemma 1, p. 182] 
then allows us to identify the family of maximizers, parametrized by x G S, 
with an element of SM , which we denote by v*(-). Letting (X* , n > 0) 
denote the chain governed by the stationary Markov strategy u*(-) and (Z* = 
v*(X*),n > 0) the corresponding control sequence, we then have 

p0(x) = E [e^*^ x *Mxi)\ , 


and, more generally, by iterating, we have, for all x e S, 


p n ip(x) = E 




,z* ,x. 


-+MK) |V„* = X 


Since 'ip(x) G int(C + (<S)), we have 0 < c < < C < oo for some 

constants c, C when -0 is chosen with, say, ||0|| = 1. Thus, for all x G S, 


C 
Hence 


e Em=0 r ( X mZm.^m+ 1 ) | * = X 


< p n < —E 
c 


e El= 0 r(X^Z^X^ +1 )^ x * = x 


log p = lim — log A e^ m =° r ^ Y ” l ’ Z " 1 ’ X " t + l) |XQ = x 

nf oo Tl L 


For any other admissible state-control sequence ((A" n , Z n ),n > 0), we have 
fnp(x) < E [e r< ' x,Zo,Xl ' > 'ip(Xi) |X 0 = x] . 

Iterating, 


p n ^{x) < E [e E -=o r(Xm ’ Zm ’ x "‘+ l) 0(^n)|A0 = x 
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and therefore 


log p < lim inf — log E 

nfoo n 


gEm=0 1'(X rn ,Z rrl ,X rrl +l) — £ 


We have proved: 


Theorem 2. Under the assumptions (A0+) and (A1+), we have, for all 
x G S, 


log p = sup lim inf — log E 

nfoo Tl 


f fl2m=0 r (Xm,Zm,X rn+ l) — j. 


where the supremum on the right is over all admissible controls and p on the 
left is given as in Theorem Ql Furthermore, this supremum is a maximum 
attained at some v*(-) G SM. □ 

An immediate consequence is the following. 

Corollary 1. Under the assumptions (A0+) and (A1+) we have 


A = logp , 

where A is the optimal growth rate of reward, as defined in (0|) ; and p is as 
defined in Theorem\7\ □ 


3 A variational formula 

By©, we have 


p = inf sup 

f >>0 li£M+{S):f fdn=l . 


p(dx) sup / p(dy\x,u)e r( ' x,u,y ^f(y) 


= inf sup / u( 


f»° v&V(S) . 


^ x f sup u / p(dy\x, U)e r{ - X ' u ^f(y) A 


f(x) 




inf sup 

x 


sup u / p{dy\x, U )e r{x ’ u ' y) f (y) A 

/(®) / 


inf sup sup / p(dy\x,u)e r ( x ’ u ' y)+losf{y) - logf ( x) 
/>>0 x u J 



7 (dx,du) / p{dy\x,u)e r{x ’ u ’ y)+losny) - losnx) . 


inf sup 

f»° 'y&V(SxU) 
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Introduce the notation 


rj(dx,du,dy) = rjo(dx)r)i(du\x)r] 2 (dy\x, u) 
= fj(dx, du)r] 2 (dy\x, u). 


Let 


Q := {r/(dx, du, dy) : 770 is invariant under the transition kernel 
/ r] 2 {dy\x,u)rn(du\x)}, 


' u 


i.e. 77 e Q iff 


j r)(dx, du)r] 2 (dy\x, u) = rjo(dy) . 


Recall that D(-||-) is convex and lower semi-continuous in both arguments 
[H]. Then 
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^iiif ^ sup log J J J ^(dx, du)p{dy\x, u ) e r ^ x,u ’ y ^ +los ^ y ^ log -^ 

inf sup log [ f f r )(dx,du)p{dy\x,u)e r ^ x,u,v ' >+9 ^~ 9 ^ 
g€C(5) 7 J J J 

inf sup sup / / / y(dx, du, dy) (r(x,u,y) + g(y) — g(x) 

3 ec(5) 7 v J J J \ 

— D{j](dx, du, dy) \ \j(dx, du)p(dy\x, u)) 

(by the Gibbs variational formula (Prop. 1.4.2(a), pp. 33-34, [1?J) 



sup sup inf 

7 77 <7EC(<S) 

— D(rj(dx, du, dy)\\j(dx, du)p(dy\x, u)) 
.(by the min-max theorem [19]) 


sup sup inf 

7 77 <7EC(5) 


V(dx, du, dy) r(x, u, y) + g(y) - g(x) 



V(dx, du, dy) \r(x, u, y) + g(y) - g(x) 

- (DM*. du) U d X , *.)) + // Kd*, du)D W y\ X , u)\\p(dy\x ,«))) 

sup yicjs) (/ / / ^ dx ’ dU) dy ^> ( r ( x > y) + 5 ( 2 /) - 30) 

fj(dx, du)D(rj 2 (dy\x, u)\\p(dy\x, u)) 

(by setting 7 = 77 ) 



sup 


inf 



y(dx, du, dy)(r(x, u, y) + g(y) - g(x) 


n&g f 9£C(<s) 

fj(dx, du)D(rj 2 (dy\x, u)\\p(dy\x, u)) 
(because • ■ • = —00 V y £ Q) 



sup 
v eg 


{j j j V(dx,du,dy)r(x,u,y) 

fj(dx, du)D(rj 2 (dy\x, u) \ \p{dy\x, u)) 



(because y E Q 


y(dx, du, dy){g{y) - g{x)) = 0 ) 










Thus we have: 


Theorem 3. Under the assumptions (A0+) and (A1+), the optimal growth 
rate of reward X, as defined in (U|) 7 has the variational characterization 


A = log p 


sup 

rt&Q 




r](dx , du , dy)r(x , u , ?/) 

^(cfc, du)D(r] 2 (dy\x, u) \ \p(dy\x , u))) , 


(9) 


where p is defined as in Theorem [ 2 □ 

The following result, which uses a limiting argument to strengthen The¬ 
orem |3j is the main result of this paper. 

Theorem 4. Under the assumptions (AO) and (Al), the optimal growth 
rate of reward X, as defined in (0|) ; has the variational characterization 


X 


sup 

r]&Q 




r](dx, du, dy)r{x, u, y) 
fj(dx, du)D(rj 2 (dy\x, u) \ \p(dy\x, u)) 


( 10 ) 

□ 


Before proving Theorem [4l let us first consider the uncontrolled case. 
We can fit this into our framework by taking U to be a set with one point, 
so that p(dy\x,u) = p(dy\x) for all u G U, for some kernel p(dy\x), and 
r(x,u,y ) = f(x,y) for all u E U, for some f(-, •). Theorem @] then specializes 
to the statement that the growth rate of the reward, under the respective 
specializations of conditions (AO) and (Al), is given by 


A = sup 
aeg 




a(dx, dy)r(x, y) 
a 0 (dx)D(a 1 (dy\x)\\p(dy\x)) 


where a(dx,dy ) = a 0 (dx)afidy\x) and 


Q 


{ a(dx,dy ) = a 0 (dx)ai(dy\x) 


J a 0 (dx)a 1 (dy\x) = a 0 (dy)}. 
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This is then a version of the Donsker-Varadhan formula for the Perron- 
Frobenius eigenvalue of a positive operator [To] . [ 18] , [22] , 


Proof of Theorem Let 7 (dy) be an arbitrary probability distribution 
on S with full support, and, for all e > 0 sufficiently small, define the kernel 

p e (dy\x,u) := — — ^——(e r{x ’ u ’ y) p(dy\x,u) + e~/(dy)) , 
a{x, u) + e V / 

and the reward 

r e (x, u, y ) := log(a(x, u) + e) , 

where 

a(x,u ) := j e r ( x,u,y ^p(dy\x,u) . 

Since this kernel and reward satisfy the conditions (A0+) and (A1+), we 
have from Theorem [3] that the optimal growth rate of reward for the risk- 
sensitive reward maximization problem for this kernel and reward, call it A e , 
is given by 


A e 


sup 




r)(dx , du , dy)r e (x , u, y) 

V(dx, du)D(r) 2 (dy\x, u) \ \p € (dy\x, u ))) . 


( 11 ) 


From the formulation of the risk-sensitive objective we see that A e is 
nondecreasing in e, and that A e > A for all e > 0, where A is defined as in 
(|2]) . This can be seen by writing the expression for the n-step multiplicative 
reward, i.e. 


g53m=0 r e{Xm,Z rn ,X m+ l) | 


as a multiple integral, which reveals that this quantity is monotonically non¬ 
decreasing in e for any initial condition x G S and any admissible control 
strategy. Thus lim e _^ 0 A £ exists and satisfies 


limA e > A . ( 12 ) 

£->■0 

To prove (TTOl) . we will first prove that 


lim A e 

e —>-0 


< 




rj(dx, du , dy)r(x, u, y) 


y(dx : du)D(r] 2 {dy\x, u) \ \p{dy\x, u))) , 


(13) 
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and then prove that 


A > sup ( 

rj€Q ' 




r](dx, du, dy)r(x , u, y) 
fj(dx, du)D(r) 2 (dy |x, u)\\p(dy\x, u)) 


( 14 ) 


Together with (IT2|) . these two claims establish (ITU . 

For fixed 77 G Q, let ^ e (r]) denote the expression inside the outer brackets 
on the right hand side of (ITT]) . Then one has 

^ e (v) = ~ J J fj(dx, du)D(r] 2 (dy\x,u)\\e r( ' x,u ’ y ' > p(dy\x,u) + ey(dy)) . (15) 


Similarly, for fixed 77 G Q, let T 0 (? 7 ) denote the expression inside the outer 
brackets on the right hand side of (TTOh . We have 

^o(v) = ~ j j fj(dx,du)D(p 2 (dy\x,u)\\e r( ' x ’ u ’ y ' ) p(dy\x,u)) . (16) 


In fact, (TT5j) reveals that for each rj G Q we have T e (r^) is nondecreasing 
in e, and together with (ITO . reveals that for all e > 0 and p G Q, we 
have T e ( 77 ) > To( 77 ). Thus we may conclude that for each p G Q the limit 
lim e _>. 0 T € (t 7 ) exists, and that this limit satisfies lim e _>. 0 T £ (t 7 ) > T 0 (? 7 ). 

Now, for all e > 0 and <5 > 0 sufficiently small, choose rf e G Q such 
that > A e — <5. Since Q is compact, there is a decreasing sequence 

(e m ,m > 1 ) with lirn m _ > . 00 e m = 0, such that the sequence (p* ,m > 1 ) has a 
limit in V{S x U x S), call it p 5 . Further, since Q is closed, we have p s G Q. 
By the lower semicontinuity of Z)(-||-) as a function of (•, •) [TTj we have 

sup T 0 (t 7 ) > T 0 ( 77 5 ) > lim T em (r/f ) > lim \ em - 6 = lim A e - 5 . 

m^foo m—>■ 00 e —>0 

Since S > 0 was arbitrary, this establishes f[T3]) . 

It remains to prove (jT4|) . If sup, 7g g T 0 (? 7 ) (i.e. the right hand side of (IT41B 
equals — 00 then there is nothing to prove, so we may assume that this is 
not the case. Given 77 G Q for which T 0 ( 77 ) 7 ^ — 00 , consider implementing 
the stationary Markov strategy defined by the kernel pi(du\x). The expected 
multiplicative reward after n steps when implementing this strategy, condi¬ 
tioned on starting with the initial distribution po(dxo), is 


71—1 


Vo(dxo) Y[ vi(d u m\x m )p(dx m+1 \x m ,u m )e r ( Xm ’ Um ’ Xrn+l) . 


m =0 
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Since rj 2 (dy\x, u ) is absolutely continuous with respect to p(dy\x, u ) for almost 
all (x,u), this equals 


n —1 


i r]2( dx m+l\ x m,um) 

m=0 


l \xm,um ) 


Let {A^} denote a controlled Markov chain with controlled transition kernel 
r] 2 (dy\x,u), initial law 770 , and controlled by rji(du\x) G RM. Then 


A 


n —1 


> lim — log 

n—>00 Tl 


Vo(dxo) JJ rji(du m \x m )p(dx m+ i\x m , u m )e r( ' Xrn,Urn ’ Xrn+1 ' > 


m =0 
n —1 


> lim — log 

n—>00 77, 


•• / rj Q (dx 0 ) JJ 77i(d?r m |a: m )772((ia: m + 1 km,Mm) 

m =0 

/ \ 1 'n2(dx rn -\-l\ x m, u m) ' 

r(x m ,Um,x m+1 )- log 


x e 


= lim — log E 

n—too Tl 


3 Erio (r(X^,Z L ,X' m+ 1 )-lo g PC))’ 


> lim —if 

n—>-oo Tl 


= Mv) 


n —1 


<+i) - log 


m=0 




(by Jensen’s inequality) 


(because rj E G). 


It follows that A, as defined in (|2]), satisfies (TT4|) . which concludes the 
proof of Theorem |4l □. 


4 Remarks 

1. Assume (AO), (Al). Fix ip E RM , and consider {(A n , Z n ),n > 0} 
governed by the randomized stationary Markov strategy ip as an uncon¬ 
trolled S x [/-valued Markov chain. To be precise, let S denote S x U, 
let U := {h} be a one point set, and define p(dy\x,u) : S x U 1 —> V(S) 
by 

p(dy\x,u) := p(dy\x,u)<p(du'\y) , 
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where x := (x,u) and y := ( y,u'). Also, let 

r(x,u,y) := r(x,u,y) . 

It is straightforward to check that the assumptions (AO), (Al) hold for 
the 5-valued chain with trivial control space U and with the transition 
kernel and one step reward as above. 

Given r(dx, du, dy, du') = To(dx)ri(du\x)r 2 (dy\x, u)rs(du'\x, u, y), write 
f(dx, du) for T 0 (dx)Ti(du\x) and f (oh/, du'\x, u ) for r 2 {dy\x, u)r 3 (du'\x, u, y). 
Let 


Q + := {r(dx, du, dy, du') 


r(dx, du)f(dy, du'\x , u) 


r(dy, du')} . 


Further, given r(dx, du, dy, du'), we define r'(dx, du, dy, du') by setting 

T o ’■= r o> := n, t’ 2 \= t 2 , T^(du'\x,u,y) := Ti(du'\y) , 

with the corresponding definitions for f',r'. We claim that r' G Q+. 
To see this, hrst observe that J f f(dx, du)f(dy, du'\x, u) = f{dy,du') 
when integrated over u! gives J f f(dx, du)r 2 (dy\x, u) = r 0 (dy). This 
means 


f'(dx, du)f\dy, du'\x, u) 


which establishes the claim. 



f(dx, du)r 2 (dy |x, u)ri(du'\y) 


T 0 (dy)T!(du'\y) 
f(dy,du') = t\ dy,du') , 


Let \ v denote the asymptotic growth rate of reward under the fixed 
randomized stationary Markov strategy ip. Then by applying Theorem 
|4] to the 5-valued chain with trivial control space U defined above, we 
have 


= sup 



r(dx, du, dy, U)r{x, u, y) — 
f(dx, du)D(f(dy, du'\x, u)\\p(dy\x, u)ip(du'\y)) ). (17) 
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Then we have 


sup A ¥ 


= sup sup 

if£RM rGS+ 




(a) 

— sup 

reG+ 




0b ) 

= sup 
r]GG 



r(dx, du, dy, U)r(x, u, y ) — 
f(dx, du)D{r 2 {dy\x, u)r 3 {du'\x, u, y)\\p(dy\x, u)p(du'\y )) 
r'(dx, du, dy, U)r(x, u, y) — 
f'(dx, du)D{r' 2 (dy\x, u ) | \p(dy\x, u)) 
rj(dx, du, dy)r(x, u, y) — 
fj(dx, du)D(rj 2 (dy\x, u ) | \p(dy\x, u)) 



= A . 

Here, to justify step (a), notice that for every r E Q + , we have shown 
that t' E G+- Therefore we have both 


r\dx, du, dy, U)r(x, u, y) = r(dx, du, dy, U)r(x, u, y) 






and 

f'(dx, du)D(r 2 (dy\x, u)\\p(dy\x, u)) 

f(dx, du)D{j 2 {dy\x, u)\\p(dy\x, u)). 

The choice of ip(du'\y) = Ti(du'\y) (which also equals r 3 (du'\x,u,y)) 
would make the expression 

f'(dx, du)r 2 (dy\x, u)D{r 3 (du'\x, u, y)\\p(du'\y)) 

equal to zero, whereas the expression 

f(dx, du)r 2 (dy\x, u)D(r 3 (du'\x, u, y) Mdu'\y)) 
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is nonnegative. To justify step (b) note that for every r G G+, we 
have To(dx)ri(du\x)T 2 (dy\x,u) G G, and conversely for every rj G G 
we get t E G+ by defining r(dx, du, dy, du') := rj(dx, du, dy)r]\(du'\y). 
Furthermore, this r satisfies r' = r. 

The upshot is that we have proved 

A = sup Xtj, . (18) 

ipGRM 

Under (A0+), (A1+), this supremum is in fact a maximum by virtue 
of Theorem 2. 


2. Since ZA(-11-) is convex and lower semi-continuous in its arguments as 
noted earlier, (TO is a concave maximization problem on the convex^ 
set 


G i := {rj(dx)<p(du\x)y(dy\x,u) : rj is invariant under the transition 
kernel x i—^ / <p(du\x)y,(dy\x,u)}. 

Ju 

ft is worthwhile to compare this formulation with the classical dynamic 
programming approach. Recall that the dynamic programming equa¬ 
tion (|6]) is the nonlinear eigenvalue problem 



p(dy\x,uMdy\u)e r ^V(y)) . 


pV (x) = sup 


Consider the standard ‘log transformation’ ((x) := logU. Then 


log p + ((x) = sup log 



p(dy\x, u)Lp(du\x)e r ^ x,u ’ y ^ + ^ v ^j . 


We treat x as a fixed parameter on the right hand side. By the Gibbs 
variational principle, we have 

logp + ((x) 

p,(du,dy\x)(r(x,u,y) + ((y)) - 


= sup sup 

V H(;-\x)€P(UxS) 


D(p(du , dy\x)\\p(dy\x, u)<p(du\x ))). 


( 19 ) 


3 See m section 11.2.3, p. 358] for the proof of convexity 
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Equation (fTUl) is the dynamic programming equation for an ergodic 
team problem whose ‘per stage payoff’ function is 

r(x,u,y ) — D(y(du,dy\x)\\p(dy\x,u)<p(du\x)), 

where y specifies an additional control variable the choice of which is in 
fact the distribution of the next state and control, whereas the original 
randomized control </? affects only the payoff. This is a team problem as 
opposed to a control problem because while both controls have the same 
objective, viz., to maximize a common reward, they are implemented in 
a non-cooperative manner. This is reminiscent of, e.g., [23], which con¬ 
siders the cost minimization formulation in which a similar procedure 
leads to a zero sum ergodic game. There does not, however, appear to 
be any corresponding development earlier for the reward maximization 
problem with a positive reward. While this is completely analogous to 
the game situation, we have obtained it without an explicit minoriza- 
tion condition as in [16], or the ‘condition B’ of [26]. We have instead 
conditions (AO) and (Al) which are relatively mild, and compactness 
of state space, which is not. We are working towards relaxing the latter. 

An important point to note here is that we have an equivalent prob¬ 
lem of maximizing a concave upper semi-continuous function over the 
convex set G\- This is in contrast with the ergodic team problem of 
maximizing the same function over the nonconvex set 

G 2 '■= {r](dx)tp(du\x)y(dy\x) : y is invariant under the transition 
kernel x t->- y\dy\x)}, 

i.e., where the controls ip, y' are chosen by the two team members non- 
cooperatively. The latter is what one obtains from the team formulation 
via log transformation. 

3. It is also worth noting that the entropic penalty implicit in our varia¬ 
tional formula also arises in different contexts [8], [23], [40] . 
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5 Examples 

5.1 Path counting on graphs 

Let G be a directed graph on a finite vertex set S of size d, with edge set 
denoted by £ G . Let M G denote the incidence matrix of the graph, namely the 
d x d nonnegative matrix M G = [m(x, y)} , with m(x, y) — 1 if (x, y) G £ G , and 
m(x,y) = 0 otherwise. Assume that each vertex has at least one out-going 
edge. For n > 1 and let N n (x ) denote the number of directed paths 

of length n starting at x. Then the growth rate of the number of directed 
paths in the graph, namely 


max lim — log N n (x) 


x£S n—>oo Tl 


exists and equals logp(M G ), where p(M G ) is the Perron-Frobenius eigenvalue 
of M g . 

It is also known that this common limit can be written as 



( 20 ) 


sup 

G —compatible (11,71") 


Here n ranges over dxd transition probability matrices that are G-compatible 
for the directed graph G, i.e. such that Tr(y\x) > 0 implies that (x,y) G £ G , 
and 7T ranges over invariant probability distributions for n. Note that this is 
the largest entropy rate among all stationary Markov chains whose transition 
probability matrix is compatible with the graph. 

This characterization of the growth rate of the number of paths in an 
irreducible graph is a consequence of the Donsker-Varadhan formula for the 
Perron-Frobenius eigenvalue of a nonnegative matrix. Let us verify this as 
a corollary of Theorem [I] in the case without controls. We take the state 
space in Theorem [4] to be S , i.e. the vertex set of the graph. The control 
space U is a set consisting of a single point, which we write as U — {u}. Let 
p(y\x,u) := for d(x) := the out-degree of x and (x,y) G S G , and let 


r(x,u,y) : 


log d{x) if (x,y)eS G 


( 21 ) 


—oo otherwise. 


Substituting these into the right hand side of (fTOl) gives the expression in 


We now bring risk-sensitive control into this mix of ideas. Let U be a 
finite set and suppose now that for each u E U we are given a directed graph 
G u with vertex set S. Assume that each vertex has at least one out-going 
edge in each G u . We pose the problem of maximizing 

maxliminf — log N n (x) , 

x£S n— XX) fl 

where now N n (x ) is the largest number of directed paths of length n one can 
create when starting at x and at each time choosing one of the graphs along 
which to move (i.e. one of the control actions) depending on the history of 
the states visited so far. More generally, we might allow for a randomized 
choice of the graph to be used at each time, based on the history of the states 
and the realizations of the control so far, and ask for the maximum growth 
rate of the expectation of the number of directed paths of each length that 
we can create in this way. 

This problem can be posed in a framework that is amenable to an applica¬ 
tion of Theorem SI As in the case without controls, we set p(y\x,u) := 
for all (x,y) E £g u , where d u {x) denotes the out-degree of vertex x in G u , 
and we now set 


log d u (x) if {x, y) E 8c n 


-oo otherwise. 


r(x,u,y) := 

According to Theorem [4] this maximum growth rate is given by 


( 22 ) 


max • 
v 


^2v(x,u) T] 2 (y\x,u) log ri 2 (y\x,u)) , 

x > u y ■ 0,2/)e£ Gu 


where the maximum is over all rj(x, u, y) = fj(x, u)rj 2 (y\x, u ) with rj 2 (y\x, u ) > 
0 implying that (x,y) E £g u , and such that 


^2fj(x,u)r) 2 (y\x.u) =rj 0 (x) , 

(x, u) 


where, as usual, rjo(x) '■= Yl,u'ni x ^ u )- Note that this has following interpreta¬ 
tion: among all stationary Markov chains ((X n , Z n ),n > 0) with state space 
S x U that are compatible with the family of graphs in the sense that if a 
transition from (x,u) to ( y,u') has positive probability then (x,y) E £q u i 
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maximize the conditional entropy of the next state given the current state- 
entropy pair, i.e. maximize H(Xi\X q ,Uq). 

The interpretation of the growth rate of the number of directed paths of a 
given length in a directed graph as an entropy rate has considerable practical 
importance in coding theory. Each directed path of length n can be viewed 
as an allowed sequence of length n, with coordinates from the state space 
S, and the set of such directed paths is then viewed as a set of constrained 
sequences m Problem 4.16], [3T] . The problem of constrained coding has 
been extensively studied. In one version of this problem, the goal is to come 
up with algorithms that can take an infinitely long sequence of symbols from 
a finite set of size m and produce 5-valued sequences as output in a one-to- 
one fashion, and such that the output sequences meet the constraints defined 
by the graph, see [31] Sec. 5.2] for more details. Naturally, it is not possible 
to do this if logm exceeds the growth rate given by (l20]k finding efficient 
algorithms to do this whenever logm is less than the growth rate given in 
(l20l) was a key early success in this area Hi, n. Investigating the question 
of constrained coding up to the maximum possible conditional entropy rate 
given by the application of Theorem [I] to the controlled graph formulation 
above would be an interesting challenge. 

5.2 Portfolio optimization 

As another example, we consider the portfolio optimization problem from [E], 
except that we consider the reward maximization framework instead of cost 
minimization as in the classic work of Cover [13]. The model is as follows. 
The underlying ‘factor process’ {X n } is a discrete time Markov chain on 
a finite state space Q := {1, • • • ,m} (say) with an irreducible transition 
matrix Q = [[g(j|i)]]. The control space will be the simplex A := {a = 
ai, • • • , a m } G 'JZ m : ai > 0 V«, JA < 1}, with a* denoting the proportion 
of wealth invested in the ith risky asset. In particular, 1 — JA a* is then the 
proportion invested in the risk-less bank account. We denote by { 7 r n } the 
A-valued control sequence, representing the trading strategy, i.e., 7T n ^ will 
be the proportion of wealth invested in the ith risky asset at time n. {W n } 
is the process of m-dimensional vectors of price relatives such that W n+ \ is 
conditionally independent of X tl i < re, Ih], 7q, i < n, given (X n , X n+1 ) and its 
conditional law given the latter is specified by a kernel is(x, y, dw ) : Q x Q 1 —> 
1Z m with support in the interior of the positive cone in 7 Z m . Let e r ,r > 0, 
denote the per period multiplier of wealth invested in the bank account (thus 
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e r — 1 is the interest rate). Let 1 denote the constant vector of all l’s. The 
evolution of the wealth process {V t } is given by 

V n+ \ = V n [e r + (n n , W n+ \ — e r l)} , 

where Vo := 1. The objective is to maximize the risk-adjusted growth rate 
of wealth 


lim inf — log E 

ntoo Tl 




(23) 


Here 9 is the risk sensitivity parameter. The control sequence {7r n } is 
assumed to be adapted to the factor process { X n } and the controls, i.e. the 
distribution of 7r n is chosen as a function of (X 0 ,..., X n , 7r 0 ,..., 7r n _i). 


It is useful to constrast the objecive we consider with that considered in 
[6] of maximizing, for 6 > 0, the quantity 


lim inf — — — log E 

n\oo 0 TL 


g-§l°gVn 


(24) 


In pa], this problem is considered by writing the objective in (j24l) as 


2 1 

lim sup - — log E 

n'j'oo v Tl 


,-|logK 


) 


and then studying the risk-sensitive cost minimization problem corresponding 
to the objective 


lim sup - — log E 

n'l'oo C' Tl 


e -|iogvy 


That positive 6 indicates risk aversion in (1241) is argued, see [7, Eqn. (2.1)], 
by writing the Taylor’s series expansion, for small 9, 


2 

£ log E 


3 -|logVn 


= E[\ogVn 


^var(logI4 ) + 0(9 2 ) 


By constrast, our formulation is able to handle both the case of risk- 
aversion and risk-seeking. The Taylor’s series expansion 


log if 


g-f logVn 


- e -E[\ogV n } + 


hvar(logV„) + o(9 2 ) 
o 
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indicates that if the objective in (1251) is multiplied by — |, then it corresponds 
to risk-aversion for positive 6 and to risk-seeking for negative 9. 

Keeping in mind that e r + (a, W — e r l) > 0 under our assumption on the 
support of u(x,y,dz), define 


y(x,a,y) := j e 2 lo sl er +( a , w er ^u(x,y,dw), 
(assumed to be < oo) 

r(x, a) := - j log ( ^ q(y\x)y(x, a, y) j , 


P(y\x,a ) : = 


q(y\x)y(x,a,y) 

Y.v’ ( i(y'\ x )L l (: x i a iy'Y 


One can show that for all n > 1 and all admissible controls, we have 


n 


log A 


-flogVn 


e 2 


= — log E 
n 


g f J2m=0 r (Xm,Trm) 


where E is the expectation with respect to the law 


p(xo)4>o(da 0 \x 0 )p(x 1 \x 0 ,a 0 )(p 1 (da 1 \x 0 ,a 0 ,x 1 )... 
x 0 n -i(da n _i|(xi, a*, 0 < i < n — 2),x n _i) , 


where p(x o) is the initial distribution of Ao, the admissible controls are de¬ 
termined by the kernels 0o('|')> • • •, 0n-i('|')> and the salient point is that the 
transition kernel for the evolution of the factor process under this change 
of measure is given by the kernel p(-|-,-) defined above. To see this, first 
observe that W \,..., W n are conditionally independent and identically dis¬ 
tributed given (Aj, 0 < i < n). Hence 


E 

e 2 lo § Vn | Aj, 7Tj, 0 < i < n 

= E 

~n— 1 

e 2 lo s Vm |Aj, 7 Tj, 0 < i < n 




m= 0 


n— 1 

n 

m =0 




| Aj, 7Tj, 0 < i < n 


n— 1 

n** mi 7T mi i 

m =0 
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so we have 





~ n —1 

E 

g-ftogVn 

= E 

| /i(A m , 7T m , A m+ i) 

_m= 0 


For an admissible control strategy, we can write this as 


E 


}•••?*£71 


n— 1 


p{% o) 11 H j {pCrri') Q"m ? ^m+1 )Q , (*^m+l |^7i 


* &0 — 1 


ra=0 


0m(da m |(xj,aj, 0 < i < m — l),x m ) , 


which is the same as 

n —1 

p(x o) JJ e _ ^ r(a;m ’ am) p(a; m+ i \x m , a m ) 

m =0 

0m(da m |(xi,ai,O < i < m — l),x m ) , 



which equals E 


_ Q_ v-'ra—1 

g 2 ^m=0 


f'(X rn ,1Tm') 


Hence the problem of maximizing (123|) is equivalent to the risk-sensitive 
control problem for a controlled Markov chain on Q with action space A and 
controlled transition probabilities p(y\x,a),x,y G Q,a G A, the objective 
being to maximize the reward 


A : = sup sup lim inf — log E 

Xn ntoo n 


g — f Em=o r(Xm,irm) | 



where the second supremum is over admissible controls. 

The optimal growth rate for the wealth is then given by 


A 


max 


( 5 Z / V(x,da){ 


r{x,a) j 


V 2 (y\x,a) log 


f r} 2 (y\x, a) V\ 
\p(y\x,a) )’) 


where 


Q := {rj(x,da,y) e V(Q x A x Q) : rj(x, da, y) — fj(x,da)rj 2 {y\x, a) 
= 7] 0 (x)r]i(da\x)r]2(y\x, a) such that r] 0 is stationary under 

the transition matrix 


rji(da\x)r) 2 (y\x, a) 
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In order to justify this, we need to verify that the conditions (AO) and 
(Al) are satisfied. Here Q plays the role of S, A plays the role of U, and 
— | r(x,a) plays the role of r(x,u,y ) in the general theory. The validity of 
(AO) follows from the continuity of the logarithm function. The validity of 
(Al) follows from the continuity of the logarithm function, the fact that Q 
is finite, and because , q(y'\x)fjb(x, a, y) is strictly positive for all (x,a). 

If we discretize A, this is a finite dimensional concave maximization prob¬ 
lem eminently amenable to standard nonlinear programming tools. 


5.3 Minimizing exit rate from a domain 

Consider a set of controlled stochastic matrices on a finite state space S = 
{1, • • • , s} denoted by P u = [[p(jji, u)]]ij e s- Here u is the control parameter 
taking values in A, where A is a compact metric action space. We assume 
that u i—^ P u is continuous and P u is irreducible for all u. Let So C S' be a 
nonempty proper subset of S and let Si := Sq denote its complement. Let 
P u denote the restriction of P u to Si and for a sequence of random variables 
{X n } with values in S, define r := inf{n > 0 : X n G S 0 }. 

We are interested in determining 

A := sup sup lim inf — log P(r > n ) , 

;eSi n t°o n 


where the second supremum is over all admissible controls, and the law of r is 
determined by the control strategy. Namely, we are interested in the problem 
of finding the slowest exit rate from S\ over admissible control strategies. 

Write P u = D U Q U where D u is a diagonal matrix with its ith diagonal 
entry d(i,u ) := ’^2j e s 1 PU\^ u ) anc ^ Qu := [feOlb ?i )]] i s a stochastic matrix 
on given by q(j\i,u) := d(i,u)~ 1 p(j\i,u), where we will also assume that 
d(i, u) > 0 for all i G Si and u G A. It can be checked that for any admissible 
control strategy and j G Si, we have 


P{t > n) — E 


X2 m =0 1°S (d(X m ,U m )) 


where U m denotes the choice of control at time m, and {X n } is the Si- 
valued Markov chain, having the transition probability matrix Qu m at time 
m. Therefore, with the choices S := Si, U := A, and r(i,u,j ) := log d(i,u), 
the problem is amenable to our general theory. 
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Disintegrate atypical element rj G V(Si xix5i) &srj 0 (i)r}i(du\i)r} 2 (j\i,u), 
and write fj(i,du ) for rj 0 (i)rh(du\i). 

Then our results show that 


A 


max 


i,jeSi' 


v(h du,j) log (d(i,u)) - 


Y / v(i,du)D(ri 2 (j\i,u)\\q(j\i,u))) 

«G5i ^ A 


where Q denotes the set of r] G V(Si x Ax S i) for which r/ 0 is invariant under 
the transition kernel f A 7]i(du\i)r]2(j\i, u). To verify this, we need to check 
the validity of the conditions (AO) and (Al). The former is a consequence 
of the assumed continuity of u i— y P u . The latter is a consequence of the fact 
that Si is finite and that u t— > Q u is continuous, which in turn follows from 
the assumed continuity of u i —> P u and the assumption that d(i,u ) > 0 for 
all i E Si and u G A. 


6 Concluding remarks 

We considered the problem of maximizing the growth rate of reward in the 
standard risk-sensitive formulation for a controlled Markov chain on a com¬ 
pact metric state space, with a compact metric action space. We took a 
non-standard approach to this problem via a nonlinear version of the Krein- 
Rutman theorem to obtain a variational formulation for the optimal reward. 
This leads to an occupation measure based concave maximization formula¬ 
tion of the control problem. 

The approach holds promise for possible use of convex optimization tech¬ 
niques for approximate solution of the risk-sensitive reward maximization 
problem, in a manner analogous to what abstract linear programming does 
for the classical additive reward problems (such as discounted or ergodic re¬ 
wards, see, e.g., [25]). We achieved this with rather few technical conditions 
except for the compactness of the state and action spaces, ft remains a major 
challenge to extend this approach to noncompact state and action spaces. 
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